Introduction

The purpose of this report is to document quality-control and data transformation procedures followed to obtain methylation values (\(\beta\) and M) for the adolescent vaping study. Much of the code for this report had already been produced by the previous analyst, Cuining Liu. Further steps were taken to dissect the data-pipeline and to support any decisions made moving forward.

Methods

Methylation samples were taken using the 850K ‘EPIC’ array. All quality control steps were conducted using the package SeSAMe ver. 1.15.7. The steps below are not necessarily presented in chronological order which they were taken.

Sample Outliers

Samples will be evaluated for a low mean-intensity to identify potential outliers with low-quality methylation values.

Probe Removal

It is common to remove experimental probes from the data to preserve the biologic variability in the dataset. Probes can be removed from the data set for two reasons:

  1. Experiment-independent Probe Masking: Probes that are masked due to non-unique mapping or influence by SNPs.
  2. Experiment-dependent Probe Masking: Probes that are masked based on the detection p-value, which represents the probability of a detection signal being background flourescence. Masks are based on the detection p-value, which represents the probability of a detection signal being background fluorescence (Zhou et al. 2018).

Experiment-independent probe masking is set by a pre-determined list of probes specific to the ‘EPIC’ array. Experiment -dependent probe masking will be determined by a p-value threshold. CpG probes with a detection p-value > 0.05 in at least 10% of samples will be masked for the purpose of this study.

Data Transformations

There are several data transformations implemented within the sesame pipeline in order to minimize variability from technical effects while still maintining biologic variability in the data. The first is a dye bias correction. Dye bias refers to bias in the methylation values due to the performance of the red and green dyes that create flourescence and are then interpreted as methylation values. There are several corrections that can be applied to make the red and green values more comparable. A non-linear dye bias correction will be presented here.

The next transformation within the sesame pipeline is background subtraction. The purpose of background subtraction is to align the distribution of beta values for Infinium I and II probes in order to make them more comparable. The default method for sesame is a normal-exponential deconvolution using out-of-band probes or noob (Triche et al. 2013).

BMIQ normalization was considered a priori due to a concern for lack of within-array normalization as interpreted from diagnostic plots. BMIQ works in conjunction with noob to correct type II probe bias (Liu and Siegmund 2016).

Beta and M-Value Distributions

The product of the sesame data pipeline is a matrix of beta values. Beta values will be converted to M-values using the logit transformation

\[M = log_2(\frac{\beta}{1 - \beta})\]

Visualizations will ensure the proper distribution of \(\beta\) and M-values, respectively.

Cell Composition Estimation and RUVm

A standard part of methylation pre-processing is to estimate cell composition for the particular sample type acquired because methylation profile can vary greatly across different cell types. Cell composition can be estimated using reference sets (e.g. cord blood has a known cell composition profile), but reference free methods do exist for samples that do not have a known profile. In the case of this study, nasal-epithelial tissue was collected and does not have a known cell profile. Due to the small sample size of this study, there is a concern that reference free cell composition estimates will not be robust to outliers or variability due to technical effects.

In light of our concern, this analysis will use Removal of Unwanted Variability for Methylation (RUVm) implemented in missMethyl ver. 1.31.0. The benefit of using this method is that it will be more robust to the small sample size. The pitfall is that we can not specify what technical variability (e.g. cell type) is captured by this transformation.

The biological variability preserved by this method can be represented by:

\[Beta \ Value = vape_status + age + sex\]

Note that recruitment center is excluded from this model because it is likely confounding vape status (see ‘teen_vape_exploratory_report_yyyy_mm_dd.html’). RUVm could also be removing variability due to recruitment center.

Sample Clustering

Visualizations of clustering by sex, recruitment center, and vape status will help to detect any technical effects that need to be accounted for in downstream analyses. CpG sites chosen for these plots are the top 1000 or 10000 most variable sites.

Results

#Samples with no RNA or Meth
no_rna_meth <- metadata_sex %>% 
  filter(is.na(rna_id) & is.na(methylation_id)) %>% 
  .[,"sid"] %>% as.list()
#Samples with no Meth only
no_meth <- metadata_sex %>% 
  filter(!is.na(rna_id) & is.na(methylation_id)) %>% 
  .[,"sid"] %>% as.character()
#Sample without vape status
no_vape_sid <- metadata_sex %>% 
  filter(is.na(vape_6mo_lab)) %>% 
  .[,"sid"] %>% as.character()
#Sample without vape statu
no_vape_sentrix <- metadata_sex %>% 
  filter(is.na(vape_6mo_lab)) %>% 
  .[,"sentrix_name"] %>% as.character()

#Fix betas and mvals
betas <- betas[,colnames(betas) != no_vape_sentrix]
mvals <- mvals[,colnames(mvals) != no_vape_sentrix]
betas_auto <- betas_auto[,colnames(betas_auto) != no_vape_sentrix]

#Label column for plots 
metadata_sex <- metadata_sex %>% 
  mutate(vape_text = if_else(vape_6mo_lab == "Vaped in Last 6 Months", "Vaped", "Not Vaped"))

Of the 51 subjects surveyed for this project, methylation data were collected for 48. Two of the three subjects (SID 105 & 137) for whom there was no transcript data also had no methylation data. One sample (SID 102) included in the RNASeq Analysis known as “Sample 12” also had missing methylation data. This sample was the center of a sensitivity analysis for inclusion in RNASeq analyse (see “sample12_sensitivity_report_2022_mm_dd.html”), and it should be noted that this subject will not be available for comparison. One sample (SID 144) lacked vape status in clinical metadata and was removed after obtaining beta values for all subjects in order to retain the benefits of that sample for normalization purposes. This sample was also excluded from the RNASeq analysis. Downstream analyses will include 47 samples.

#Get all samples missing data
no_dat_samps <- metadata_sex %>% 
  filter(is.na(rna_id) | is.na(methylation_id)) %>% 
  select(sid, rna_id, methylation_id) %>% 
  rename(rna_id = "RNASeq ID",
         methylation_id = "Methylation ID",
         sid = "Subject ID")
#set option
options(knitr.kable.na = 'No Data')
#make kable table
no_dat_samps %>%
  mutate(`RNASeq ID` = if_else(is.na(`RNASeq ID`), "No Data", `RNASeq ID`),
         `Methylation ID` = if_else(is.na(`Methylation ID`), "No Data", `Methylation ID`))%>% 
  kablize()
Subject ID RNASeq ID Methylation ID
102 Sample12 No Data
105 No Data No Data
137 No Data No Data
#Filter out necessary samples from metadata
metadata_sex <- metadata_sex %>% 
  filter(!is.na(rna_id) , !is.na(methylation_id), !is.na(vape_6mo_lab))

Outlier Samples

low_qual_id <- as.character(methylation_qc[methylation_qc$outlier == T, "sid"])

SID 127 falls in the bottom 1% of mean signal intensity readings.

Figure 1: Sample Quality Outliers

## Plot Outliers ##
methylation_qc %>% 
  ggplot(aes(y = log2(mean_intensity))) +
  geom_boxplot() +
  geom_text(data = methylation_qc[methylation_qc$outlier == T, ], aes(y = log2(mean_intensity), x = -0.03, label = methylation_qc[methylation_qc$outlier == T, ]$sid), col = "red")+
  geom_hline(aes(yintercept = log2(quantile(mean_intensity, probs = 0.01)), col = "1st Percentile"), linetype = "dashed") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        legend.position = "bottom") +
  scale_color_manual(values = c("red")) +
  labs(y = "log2(Mean Intensity)",
       color = NULL)

Considering the small sample size in this experiment, it is advisable not to remove SID 127, but it is worth noting for downstream analyses.

Probe Masking

#non-experimental probes removed
raw_idat_example <- readIDATpair(here("DataRaw/methylation/RawIdat/Slide_1_205310770060/205310770060_R01C01"))

non_exp_probes_rm <- qualityMask(raw_idat_example)$mask %>%
  sum() %>% as.numeric()

#experimental probes removed
exp_probes_rm <- as.numeric(length(raw_idat_example$Probe_ID)) - dim(betas)[1] - non_exp_probes_rm #total probes - probes kept - nonexp probes removed

#total probes removed
total_probe_rm <- non_exp_probes_rm + exp_probes_rm

105,454 CpG sites were removed by non-experimental probe masking and 14,349 were removed by experimental probe masking (detection p-value > 0.05 in \(\ge\) 10% of samples). 119,803 probes were removed in total. 746750 probes were included for the analyses presented in this report.

Data Transformations

Some of the normalization procedures intended for this analysis performed poorly in diagnostic plots. For that reason, several variations of the sesame and minfi normalization pipelines were considered and compared. A separate report is available that details all of the normalization pipelines that were considered called “normalization_method_compare_yyyy_mm_dd”. A high-level summary of some of the methods compared in that report will be described here. For the purpose of this report, the following pipeline was used:

\[ non-experimental \ probe \ removal \rightarrow InferInfiniumIProbes \rightarrow experimental \ probe \ removal \rightarrow \\ noob \rightarrow BMIQ \rightarrow non-linear \ dye \ bias \ correction\]

Figure 2 visualizes dye bias for the method ‘dbNL + noob + BMIQ’ in which dye-bias correction comes before the other transformations, and ‘noob + BMIQ + dbNL’ where dye-bias correction come after.

Figure 2: Dye Bias Correction

In summary, the dye bias correction performs well when only ‘noob’ is used as a normalization process. Adding BMIQ to the normalization pipeline induces dye-bias back into the plots (plots in the second row perform poorly). Moving the dye-bias correction to the end of the normalization pipeline seems to correct this issue. Table 1 attempts to quantify the performance of each method. Each value represents the proportion of probes that fell within the mean distance of the ‘x = y’ line, using the mean distance of the raw plots.

db_eval <- read_csv(here("DataProcessed/methylation/dye_bias_quality_check.csv")) %>% as.data.frame()

rownames(db_eval) <- c("Raw", "dbNL + Noob + BMIQ", "Noob + BMIQ + dbNL")

db_eval %>% kablize()
205310770093_R03C01 205003700122_R08C01 205310760169_R07C01
Raw 0.373 0.468 0.350
dbNL + Noob + BMIQ 0.436 0.426 0.434
Noob + BMIQ + dbNL 0.983 0.978 0.981

Table 2 concludes that ‘noob + BMIQ + dbNL’ has the lowest proportion of sites that fall outside of the mean difference.

Figure 3 shows the distribution of Infinium I and II beta values for three randomly selected samples across the two different methods.

Figure 3: Sample Beta Value Distribution for Infinium I and II Probes

It is clear from figure three that the Infinium I beta value distribution is shifted and has better allignment with Infinium II probes, making them more comparable. Moving dye bias correction to the end of the pipeline induces some noise around \(\beta = 0.7\), but is acceptable for the added benefit of limited dye-bias. The ‘noob + BMIQ + dbNL’ method will be used moving forward.

Beta and M-Value Distributions

Overall, \(\beta\) and M-values followed their expected distributions with peaks near 0 (unmethylated sites) and 1 (methylated sites) in the beta distribution.

Figure 4: \(\beta\) and M-Value Distributions

Figure 4 indicates some noise around \(\beta\) = 0.5 or \(M\) = 0, but falls within acceptable levels.

Clustering by Sex

Visualization of the samples by median X and Y chromosome intensities helps to identify samples with poor quality or samples whose clinical sex do not match their predicted sex base on their methylation profile. Figure 5 displays plots of median intensity for X and Y chromosomes color-coded for both clinical and predicted sex.

Figure 5: Sex Verification

#Clinical Sex
clin_sex_plot <- metadata_sex %>% 
  ggplot(aes(x = log2(medianX), y = log2(medianY), col = sex_lab)) +
  geom_text(aes(label = sid)) +
  labs(col = "Sex") +
  theme(legend.position = "bottom") +
  scale_color_manual(values = c("deepskyblue", "deeppink3"))

#Predicted Sex
predicted_sex_plot <- metadata_sex %>% 
  ggplot(aes(x = log2(medianX), y = log2(medianY), col = pred_sex)) +
  geom_text(aes(label = sid)) +
  labs(col = "Predicted Sex") +
  theme(legend.position = "bottom")+
  scale_color_manual(values = c("deepskyblue", "deeppink3"))

ggarrange(clin_sex_plot, predicted_sex_plot, nrow = 1, ncol = 2, common.legend = T)

Figure 5 demonstrates sample clustering by sex for median X and Y intensities. Clinical sex matches predicted sex for all samples. It should be noted that SID 122 recorded ‘non-binary’ as clinical sex, and sex was inferred by methylation data for other portions of this project.

Sample Clustering

#convert cpgsites back to rownames
mds_prep <- function(dat) {
  dat <- as.data.frame(dat)
  rownames(dat) <- dat[,"CpG_Site"]
  new_dat <- dat %>% 
    select(-CpG_Site) %>% 
    as.matrix()
  return(new_dat)
}

#Allow MDS to plot more MDS coordinates
mdsPlot_n <- function(dat, numPositions = 1000, sampNames = NULL,
                    sampGroups = NULL, xlim, ylim, pch = 1,
                    pal = brewer.pal(8, "Dark2"), legendPos = "bottomleft",
                    legendNCol, main = NULL, n_mds = 2) {
    # Check inputs
    if (is(dat, "MethylSet") || is(dat, "RGChannelSet")) {
        b <- getBeta(dat)
    } else if (is(dat, "matrix")) {
        b <- dat
    } else {
        stop("dat must be an 'MethylSet', 'RGChannelSet', or 'matrix'.")
    }
    if (is.null(main)) {
        main <- sprintf(
            "Beta MDS\n%d most variable positions",
            numPositions)
    }

    o <- order(rowVars(b), decreasing = TRUE)[seq_len(numPositions)]
    d <- dist(t(b[o, ]))
    fit <- cmdscale(d, k = n_mds)
    if (missing(xlim)) xlim <- range(fit[, n_mds - 1]) * 1.2
    if (missing(ylim)) ylim <- range(fit[, n_mds]) * 1.2
    if (is.null(sampGroups)) sampGroups <- rep(1, numPositions)
    sampGroups <- as.factor(sampGroups)
    col <- pal[sampGroups]
    if (is.null(sampNames)) {
        plot(
            x = fit[, n_mds - 1],
            y = fit[, n_mds],
            col = col,
            pch = pch,
            xlim = xlim,
            ylim = ylim,
            xlab = paste0("MDS ", as.character(n_mds - 1)),
            ylab = paste0("MDS ", as.character(n_mds)),
            main = main)
    } else {
        plot(
            x = 0,
            y = 0,
            type = "n",
            xlim = xlim,
            ylim = ylim,
            xlab = paste0("MDS ", as.character(n_mds - 1)),
            ylab = paste0("MDS ", as.character(n_mds)),
            main = main)
        text(x = fit[, n_mds - 1], y = fit[, n_mds], sampNames, col = col)
    }
    numGroups <- length(levels(sampGroups))
    if (missing(legendNCol)) legendNCol <- numGroups
    if (numGroups > 1) {
        legend(
            x = legendPos,
            legend = levels(sampGroups),
            ncol = legendNCol,
            text.col = pal[seq_len(numGroups)])
    }
}

plotCpg <- function(dat, cpg, pheno, type = c("categorical", "continuous"),
                    measure = c("beta", "M"), ylim = NULL, ylab = NULL,
                    xlab = "", fitLine = TRUE, mainPrefix = NULL,
                    mainSuffix = NULL) {
    if (is.numeric(cpg)) cpg <- rownames(dat)[cpg]
    type <- match.arg(type)
    measure <- match.arg(measure)
    if (is(dat, "MethylSet") || is(dat, "RGChannelSet")) {
        if (measure == "beta") {
            # NOTE: as.matrix() necessary in case 'dat' is a
            #       DelayedArray-backed minfi object
            x <- as.matrix(getBeta(dat[cpg, ]))
            if (is.null(ylab)) ylab <- "Beta"
            if (is.null(ylim)) ylim <- c(0, 1)
        } else if (measure == "M") {
            x <- getM(dat[cpg, ])
            if (is.null(ylab)) ylab <- "M"
            if (is.null(ylim)) ylim <- range(x)
        }
    } else {
        if (is.null(ylab)) ylab <- "unknown"
        x <- dat
        if (is.vector(x)) {
            x <- matrix(x, ncol = 1)
        } else {
            x <- as.matrix(x)
        }
    }
    main <- paste(mainPrefix, cpg, mainSuffix)
    names(main) <- cpg
    for (i in cpg) {
        if (type == "categorical") {
            stripchart(
                x = x[i, ] ~ pheno,
                vertical = TRUE,
                method = "jitter",
                jitter = 0.15,
                ylab = ylab,
                ylim = ylim,
                main = main[i])
        } else if (type == "continuous") {
            plot(
                x = pheno,
                y = x[i,],
                ylab = ylab,
                xlab = xlab,
                ylim = ylim,
                main = main[i])
            if (fitLine) abline(lm(x[i, ] ~ pheno), col = "blue")
        }
    }
}
#Code to prep data and make mds plots
make_mds_plots_n <- function(vals, meta, plot = c("sex", "vape", "center"), nump = 1000, lim = c(-10, 10), mlim = c(-50, 50), n_mds = 2){
  #Get only samples with methylation data and arrange in same order as betas so that you can use for mds plots
  meta <- meta %>% 
    filter(sentrix_name %in% colnames(vals)) %>% 
    arrange(factor(sentrix_name, levels = colnames(vals)))
  
  #Prep values for mds plots
  vals <- mds_prep(vals)
  
  #Get mvals
  mvals <- BetaValueToMValue(vals)
  
  # Set Color Pallattes
  sex_pal <- c("deepskyblue", "deeppink3")
  center_colors <- c("cyan3", "chartreuse3", "darkorange")
  
  
  #MDSPlots (Betas)
  par(mfrow = c(1,2))
  if ("sex" %in% plot) {
  #Clustering by Sex
  mdsPlot_n(vals, sampGroups = meta$sex_lab, sampNames = meta$sid,
          numPositions=nump, ylim = lim, main = paste0("Sex - Betas - ", as.character(nump)), 
          pal = sex_pal,
          n_mds = n_mds)
  
  mdsPlot_n(mvals, sampGroups = meta$sex_lab, sampNames = meta$sid,
          numPositions=nump, ylim = mlim, main = paste0("Sex - M-values - ", as.character(nump)), 
          pal = sex_pal,
          n_mds = n_mds)
  }
  if ("vape" %in% plot) {
  #Clustering by Vape Status
  mdsPlot_n(vals, sampGroups = meta$vape_text, sampNames = meta$sid,
          numPositions=nump, ylim = lim, main = paste0("Vape Status - Betas - ", as.character(nump)),
          n_mds = n_mds)
  
  mdsPlot_n(mvals, sampGroups = meta$vape_text, sampNames = meta$sid,
          numPositions=nump, ylim = mlim, main = paste0("Vape Status - M-values - ", as.character(nump)),
          n_mds = n_mds)
  }
  if ("center" %in% plot) {
  #Clustering by center
  mdsPlot_n(vals, sampGroups = meta$recruitment_center, sampNames = meta$sid,
          numPositions=nump, 
          ylim = lim, 
          pal = center_colors, 
          main = paste0("Recruitment Center - Betas - ", as.character(nump)),
          n_mds = n_mds)
  
  mdsPlot_n(mvals, sampGroups = meta$recruitment_center, sampNames = meta$sid,
          numPositions=nump, ylim = mlim, pal = center_colors, 
          main = paste0("Recruitment Center - M-values - ", as.character(nump)),
          n_mds = n_mds)
  }
}

For each feature of interest, MDS plots were made using a) probes from the whole genome and b) only autosomal probes. Using only autosomal probes removes the inherent separation by sex to get a better idea of clustering patterns.

Figure 6: Sample Clustering (Sex)

All Chromosomes (Including X and Y)

#Whole Genome
make_mds_plots_n(betas, metadata_sex, plot = "sex", nump = 1000, mlim = c(-20, 20))

make_mds_plots_n(betas, metadata_sex, plot = "sex", nump = 10000, lim = c(-40, 40), mlim = c(-100, 100))

Autosomes Only

#Autosomal
make_mds_plots_n(betas_auto, metadata_sex, plot = "sex", lim = c(-10, 10), mlim = c(-50, 50))

make_mds_plots_n(betas_auto, metadata_sex, plot = "sex", lim = c(-10, 10), nump = 10000)

When retaining the X and Y chromosomes, the samples cluster by sex, as expected. When looking only at the autosomal chromosomes, there is no additional clustering that needs to be accounted for.

Figure 7: Sample Clustering (Vape Status)

All Chromosomes

make_mds_plots_n(betas, metadata_sex, plot = "vape", nump = 1000)

make_mds_plots_n(betas, metadata_sex, plot = "vape", nump = 10000, lim = c(-10, 10), mlim = c(-100, 100))

Autosomes Only

#Autosomal
make_mds_plots_n(betas_auto, metadata_sex, plot = "vape", lim = c(-10, 10), mlim = c(-50, 50))

make_mds_plots_n(betas_auto, metadata_sex, plot = "vape", lim = c(-10, 10), nump = 10000)

The previous version of this report (“Methylation_QC_2022_09_08.html”) saw no clustering by vape status. After the addition of ‘BMIQ’ normalization, there may be some weak clustering by vape status. It is also worthwhile to check for clustering along further MDS coordinates.

Figure 8: Vape Status Clustering (MDS 3 & 4)

make_mds_plots_n(betas, metadata_sex, plot = "vape", nump = 1000, lim = c(-5, 5), mlim = c(-15, 15), n_mds = 4)

make_mds_plots_n(betas, metadata_sex, plot = "vape", nump = 10000, lim = c(-5, 5), mlim = c(-15, 15), n_mds = 4)

Looking at MDS coordinates 3 and 4 for vape status, there is still no obvious clustering pattern. This may indicate technical effects still present in the data that were not handled well by ‘noob + bmiq’ normalization, or it could indicate a lack of biological signal. It is hard to say at this stage, but perhaps other normalization procedures are necessary.

Figure 8: Sample Clustering (Recruitment Center)

make_mds_plots_n(betas, metadata_sex, plot = "center", nump = 1000)

make_mds_plots_n(betas, metadata_sex, plot = "center", nump = 10000, lim = c(-100, 100), mlim = c(-100, 100))

There is no visual clustering by recruitment center.

RUVm

For RUVm, two methods were considered. The first was the standard implementation from missMethyl. This method estimated all of the probes to act as empirical controls (see Adjusted: RUV-inverse below). The methods that implement “RUV-4” appeared to be more robust for k = 1 and k = 2 factors.

Figure 9: K-Factor Selection

From the elbow plot above, we can see that the total within-cluster sum of squares gains great benefit from k = 1 to k = 2 and then levels off. The following plots will consider k = 1 and k = 2 factors for inclusion in the model.

Figure 10: Clustering by Vape Status after RUVm

Figure 10 shows pca plots after pursuing several versions of RUVm. The RUV-inverse method proved to be problematic as it estimated all probes to be empirical controls and attributes 100% of the variability in the data to the first PC. The bottom two plots present PCA plots for a variation of RUVm which uses ‘RUV-4’ as opposed to RUV-inverse in the second round of empirical control estimation. This method performed favorably over RUV-inverse. There is noticibly better clustring by vape status in both the K = 1 and K = 2 plots.

Recruitment Center


There is no visual clustering by recruitment center, however it will still be adjusted for in the model due to it’s confounding relationship with vape status.

Conclusions

Of the normalization methods that were attempted for these data, this report shows that ‘noob + BMIQ + dbNL’ is a favorable method based on the distribution of Infinium I and II beta values and dye bias diagnostics. The review of sample quality and clustering patterns returned one subject (SID = 127) that fell in the 1st percentile for mean intensity. That sample did not show up as an outlier in MDS plots after normalization and will be included for further analyses. There is also a clear sex effect in the data that will be accounted for by including sex as a model covariate. Up to this point, MDS and PCA plots show notable clustering by vape status and not by recruitment center. Both covariates will still be included in the final model to account confounding.

References

Liu, Jie, and Kimberly D. Siegmund. 2016. “An Evaluation of Processing Methods for HumanMethylation450 BeadChip Data.” BMC Genomics 17 (1). https://doi.org/10.1186/s12864-016-2819-7.
Triche, Timothy J., Daniel J. Weisenberger, David Van Den Berg, Peter W. Laird, and Kimberly D. Siegmund. 2013. “Low-Level Processing of Illumina Infinium DNA Methylation BeadArrays.” Nucleic Acids Research 41 (7): e90–90. https://doi.org/10.1093/nar/gkt090.
Zhou, Wanding, Timothy J Triche, Peter W Laird, and Hui Shen. 2018. “SeSAMe: Reducing Artifactual Detection of DNA Methylation by Infinium BeadChips in Genomic Deletions.” Nucleic Acids Research, July. https://doi.org/10.1093/nar/gky691.